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Abstract 

We initiate the study of trade-offs between sparsity and the number of measurements in sparse re¬ 
covery schemes for generic norms. Specifically, for a norm || • ||, sparsity parameter k, approximation 
factor K > 0, and probability of failure > 0, we ask: what is the minimal value of m so that there is a 
distribution over m x n matrices A with the property that for any x, given Ax, we can recover a k-sparse 
approximation to x in the given norm with probability at least 1 — P? We give a partial answer to this 
problem, by showing that for norms that admit efficient linear sketches, the optimal number of measure¬ 
ments in is closely related to the doubling dimension of the metric induced by the norm || • || on the set of 
all k-sparse vectors. By applying our result to specific norms, we cast known measurement bounds in our 
general framework (for the £p norms, p G [1,2]) as well as provide new, measurement-efficient schemes 
(for the Earth-Mover Distance norm). The latter result directly implies more succinct linear sketches for 
the well-studied planar k-median clustering problem. Finally, our lower bound for the doubling dimen¬ 
sion of the EMD norm enables us to address the open question of [Frahling-Sohler, STOC’05] about the 
space complexity of clustering problems in the dynamic streaming model. 


1 Introduction 


The field of sparse recovery studies the following question: for a signal v, when is it possible to compute an 
approximation xtox that is parameterized by only a small number coefficients, given only a small number 
of linear measurements of v? The answers to this basic question, i.e., the sparse recovery schemes, have 
found a surprising number of applications in a broad spectrum of fields, including compressive sensing 
IICRT061 [Don06 H . data stream computing BMutOSII (see also the resources at sublinear. info) and Fourier 
sampling IIGIIS14II . 

A particularly useful and well-studied formalization of this question is that of stable sparse recovery. A 
general formulation of the problem is as follows. For a norm || • ||, sparsity parameter k, probability of failure 
P and an approximation factor A' > 0, design a distribution over mxn matrices A which has the following 
property: 

There is an algorithm A that, for any v, given Ax, recovers a vector x = A{Ax) such that 

11-^ —min ||v —v'll (1) 

A:-sparse x' 


with probability at least 1 — P. 

Here we say that A is ^-sparse if it has at most k non-zero coordinate^i]- The typical choices of the norm || • || 
are either li or £ 2 - However, several other variants have been studied as well: IIBGI+081IAZGRT51 studied 
sparse recovery under general Ip norms, BGIPlOlUPl HIMDOII considered the Earth-Mover-Distance (EMD) 
norm, while HWarldll considered rearrangement-invariant block norms. 

It is easy to observe that the number of measurements m must depend on the sparsity parameter k: the 
more information about the signal we want to acquire, the more measurements must be taken. Eor iy and £2 
norms, the tradeoff between m and k is well-understood: it is known that m = 0{k\og{n/k)) measurements 
suffice IICRT06L and this bound is tight IIDon061 fPIPW lOB . Eor other norms, however, our understanding 
of the tradeoffs is much more limited. 

1.1 Our results 

In this paper we initiate the study of sparsity-measurements trade-offs for generic norm^ Our results 
generalize the previously known tradeoffs, and provide improved bounds for specific norms, notably EMD 
and £p for p S (0,1). Eurther, our results for EMD immediately yield new sketching algorithms and new 
lower bounds for the low-dimensional k-median clustering problem. 

Our first result shows that, for norms that admit efficient linear sketches the number of measurements 
sufficient for sparse recovery is closely related to the doubling dimension of k-sparse vectors under that 
norm. Eormally, we prove the following theorem. 

Theorem 1.1. Suppose that X = (M”, || • ||) A an n-dimensional nomied space and I <k<n be the sparsity 
parameter Assume that, for some (distortion) parameter D > I there is a distribution over s x n random 
(sketch) matrices S and an (estimator) function P : —)• M such that for any x and a k-sparse y we have 

Pr[||v-y|| <P(5v,5y) <D||v-y||] >2/3. 

'Further generalizations of the problem can be obtained by allowing the sparsity in arbitrary basis, or by allowing different 
norms on the LFIS and RFIS of Equation[T] Although important, we will not consider these generalizations in this paper. 

^In fact, our results hold even for quasi-norms, e.g., ip norms for p < 1 (see Preliminaries for more details). However, for the 
sake of simplicity, in the rest of the paper we will mostly focus on norms. 
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Furthermore, let d be the doubling dimension of the set of k-sparse vectors from M" with respect to the 
metric induced by || • ||. Then, for every 0<£,T<l/3 there exists a distribution over random matrices 
A e with 


m = O 


s- (^J-log(D/£)+loglog(l/T)^^ 


such that for every G M" given Ax we can recover with probability at least 2/3 a vector £g M” such that 


11-^< (1 + min 11-^ II + '^11-^11 • (2) 

k-sparse x* 

To explain the theorem, we first observe that the guarantee given by Q is analogous to the one given 
by O, with the exception of the extra additive term t||x||. The “precision parameter” T can be made arbi¬ 
trarily small, at a price of increasing the number of measurements by an extra loglog(l/T) term. Similar 
tradeoffs between the precision and the number of measurements are quite common in compressive sensing 
scheme^ although we do not know whether this extra term is necessary in our setting. Apart from the 
precision dependence, the number of measurements m is linear in the doubling dimension d, linear in the 
sketch length s and logarithmic in the distortion D. 

Our theorem requires that the normed space of interest admits efficient linear sketches. We believe that 
some variant of this assumption is necessary for sparse recovery, as such sketches are needed if one wants 
to estimate the approximation error, i.e., the RHS of Equation |2] However, this intuition does not lend itself 
to a formal argument, as e.g., for the i\ norm there exist sparse recovery schemes IICRT061 lDon061 that 
satisfy Equation |2] w/t/zont explicitly estimating the approximation error. Still, the i\ norm supports efficient 
sketches, which suggests that some form of sketchability of the norm could be a necessary condition. 

To illustrate Theorem ll.il consider the case of the ip (quasi)-norms for p G [0,2]. It is known lUndObH 
that these norms allow sketches with distortion D = 1 + £ and dimension 5 = 0{\/e^) for any £ > 0, and 
it is also immediate that the doubling dimension d is 0{k\og{n/k)). Therefore, for p G [1,2] our theorem 
reproduces the known optimal 0{k\og{n/k)) measurement bound, up to the dependence on the precision 
parameter T. The same bound is obtained for p G (0,1). The latter result is, to the best of our knowledge, 
new. 

We note that Theorem I Ell is not efficient: it does not provide a polynomial time algorithm for recovering 
X from Ax. Given the generality of the setting, in particular, the fact that it allows a general (sketchable) norm 
II • II , we believe that a general polynomial time recovery algorithm is unlikely to exist. However, it is possible 
that efficient algorithms exist for specific norms which have good computational properties. Eor example, 
we show that for the case of the Earth-Mover Distance norm discussed in more detail below, the recovery 
algorithm runs in time polynomial in n and log^n. In particular, the running time is polynomial for any 
constant k. 


Lower bound The Ip norm example shows that the bound of Theorem 1 1.1 1 is tight for some norms. In 
fact, one can show that the linear dependence on the doubling dimension d is necessary for all norms whose 
“aspect ratio” is bounded by a polynomial in n. In particular, we show the following theorem. 

Theorem 1.2. Consider any norm H-H over W for which ^ < n‘^ for some constant c. Let 7^ C M” 

denote the set of k-sparse vectors and d > 1 denote the doubling dimension of r\Tk with H-H. Thenany 
sparse recovery scheme for [0,oo)” with approximation factor K requires m = D.{d/logK) measurements. 

Note that the theorem holds even for vectors x>0, which will be useful in the context of the Earth-Mover 
Distance. 

^E.g., in most of the existing sparse Fourier transform algorithms the sample complexity depends logarithmically on the preci¬ 
sion parameter IGIIS14I . 


2 











Randomized/Deterministic 

Sketch length m 

Approximation factor 

Deterministic 

Deterministic 

Randomized 

k\ogn\og{n/k) 

k\og{n/k) 

k\og{n/k) 

0(1) 

^J\og{n/k) 

0(1) 


Figure 1: Performance of sparse recovery schemes for the EMD from IIIPIIL The schemes assume that the 
input vector v is non-negative. Each result implies a sketching scheme for the k-median problem with the 
same parameters. 


Earth-Mover Distance Our results have direct implications for sparse recovery over the Earth-Mover 
Distance (EMD) norm. This norm is defined over n-dimensional vectors with n = A^, where such vectors 
can be interpreted as functions [A]^ —)■ M. Informally, for vectors v,y : [A]^ —)• which have the same 

li norm, the EMD is defined as fhe cost of the min-cost flow fhat fransforms x into y, where the cost of 
transporting a “unit” of mass from a point p G [A]^ of v to a point q G [A]^ of y is equal to the i\ distanc^ 
between p and q. See Preliminaries for a formal definition. 

Earth-Mover Distance and its variants are popular metrics for estimating similarity between images and 
feature sets URTGOOl IGD05II . Eurthermore, the k-sparse approximation of non-negative vectors under the 
EMD norm has the following natural interpretation. Eet x be the k-sparse vector closest to x under this norm. 
Then one can observe that the non-zero entries of x correspond to the cluster centers in the best k-mediarH 
clustering of x. Thus, sparse recovery schemes for the EMD norm provide methods for recovering near- 
optimal solutions to the planar k-median problem from few linear measurements of the input point-sets, a 
problem that has attracted a considerable attention in streaming and sketching literature IIInd04[|FS051IIPll1 . 

The state of the art schemes for this problem are listed in Eigure[I] In particular, the best known bound 
for the number of measurements is 0{k\og{n/k)), which mimics the best possible bound achievable for 
sparse recovery in the i\ norm. 

We show that Theorem I 1.1 [ provides new results for this problem. Specifically, we show thaf the doubling 
dimension of the EMD norm over k-sparse vectors is only 0{k\oglogn). Combined with the known fact 
that the EMD norm can be embedded into with distortion 0{logn) IICha021 IIT03I1 (and therefore its 
sketching complexity s is constant), this implies that there exist a sparse recovery scheme for EMD with 
approximation factor 0(log?i) that uses only 0(k(loglog?i)^) measurements (ignoring the dependence on 
the precision). The running time of recovery procedure is polynomial in A and log^A (again ignoring the 
dependence on the precision), which is polynomial in A for any k up to logA/loglogA. We further show 
that the result can be strengthened in three ways: 

• By performing a more careful analysis of the embedding procedure of KT03II . we show that it in 
fact incurs a distortion of 0(logk-|-loglogn) with constant probability, which is sufficient for our 
purposes. 

• By using a variant of the embedding (given in IIInd07ll ) and combining it with a sketch of IIVZ12II . we 
show the distortion can be reduced further to O(logk) while increasing the sketch length by a factor 
of C?(log^ n) for any constant 5 > 0. Note that in the case of constant k, the approximation we obtain 
is constant as well. 

• Einally, we consider vectors x with the property that, for some integer N, all entries Xp are multiples 

^One can also use the I 2 distance. Note that the two distances differ by at most a factor of y/2 for two-dimensional images. 

^For completeness, in our context the /^-median clustering problem is defined as follows. First, each pixel p e [A]^ is interpreted 
as a point with weight Xp. Then the goal is to find a set C C [A]^ of k “medians” that minimizes the objective function E/)e[A] 2 Xp • 
mincecl|p-c||2- 
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of 1/A^ (in this case we say that a: has granularity l/N). Such vectors correspond to characteristic 
vectors of multisets of size N, and naturally occur in the unweighted A:-median problem over point 
sets of size N. In this case we show that, in the bounds for the doubling dimension and distortion, we 
can replace log log n by log log Notably, the bounds we obtain in this case are independent of the 
ambient dimension n. 

By combining these bounds with Theorem 11.11 we obtain sparse recovery schemes for EMD with the 
guarantees as in Figure|2](see also Section|E]for the formal statement of the results). 


Randomized/Deferminislic 

Skefch lenglh m 

Approximalion facfor 

Randomized 

Randomized 

Randomized, lower bound 

k(log log n) (log (log k -h log log n)) -h log log (1 /t) 
klog^n-hloglog(l/T) 
Q.{k{\og\og{n/k))/\ogK) 

f?(logk-|-loglog?i) 

0{\ogk) 

K>2 


Figure 2: Performance of our sparse recovery schemes for the EMD. The schemes assume that the input 
vector V is non-negative. The first two results imply a sketching scheme for the k-median problem with the 
same parameters. 


The aforementioned bounds are quite surprising, as they are provably impossible to achieve for the or 
ii norms. In particular—for i\ and ^ 2 —one needs D(logn) measurements to achieve constant approximation 
factor even for k = \, and D (log n/log log n) measurements to achieve 0{logn) distortion UDIPWIOL This 
means that the EMD norm is actually easier than Ip norms from the sparse recovery perspective, at least in 
a range of parameters. 

We also show that at least one log log n factor in the measurement bound is necessary as long as k > 2, by 
proving a lower bound for the doubling dimension of k-sparse vectors under EMD and using Theorem ll.2l In 
fact, our lower bound argument applies almost verbatim to the space complexity of the following data stream 
problem: design a data structure that maintains a vector x under increments and decrements of its coordinates 
which, when queried, reports a k-sparse approximation to v with approximation factor K with a constant 
probability. As discussed earlier, in the context of the EMD norm this task corresponds to the problem of 
maintaining a k-median clustering of a dynamic point set where points can be inserted and deleted (i.e., 
the coordinates of x can be incremented and decremented). As we show in Theorem IB. 31 the space bit 
complexity of this problem is Q.{^^logn) for general norms, thus, in particular, fl(j^^log(log(^)) • 
log A) for the EMD norm. The last bound addresses the open question of IIFS05II (Section 7) who asked 
whether it is possible to maintain a constant size (for fixed k and K) “core-sefo for fhe k-median and k- 
means problem in dynamic dafa sfreams. Allhough our argumenfs do nof consider fhe core-sef size per se, 
we do show lhal any algorilhm lhal solves k-median and k-means 0 in fhe dynamic dafa slream model musl 
use a super-constant number of words of size log A, even for conslanl k and K. 

Finally, we show fhal for fhe case of k = 1, a sparse recovery scheme exisfs wifh 0(1) measuremenls 
for conslanf d and e, independenf of n. This is again in sharp conlrasf fo l\ or ^2 norms, as well as fhe 
aforemenlioned case of k > 2. 

1.2 Our techniques and related work 

Our upper bound for fhe number of measuremenls relies on fhe connection befween sparse recovery and fhe 
approximate nearesl neighbor search. Specifically, our goal can be phrased as finding fhe nearesf neighbor 

^Informally, a core-set for the fe-median problem over a set of points P is a weighted subset C G P such that a solution to C 
provides an approximate solution to P. Core-sets provide a tool for solving streaming problem for fc-means and ^-median problems 
in data streams. See IIFS05I for more details. 

^The lower bound for the A:-means problem is presented in Appendix IgI 
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of ;c in a set of bounded doubling dimension. The latter problem can be solved using the navigating nets data 
structure IIKL04II . and indeed we are using a similar top-down search approach in our algorithm. However, 
we need to deal with complications that arise due to the fact that in our setting we can only estimate distances 
approximately and with a certain probability. Specifically, to obtain the desired bound, we need to ensure 
that the total number of distances that our sketch needs to preserve is only linear in the depth of the tree. 
This allows us to bound the probability of failure of the algorithm by taking the union bound over a small 
number of events. It is easy to observe, however, that the path in the tree taken by the search algorithm is 
adaptive, i.e., the approximation errors incurred by the sketch at one level affect the points considered by 
the algorithm at the next level. Nevertheless we show that the path cannot be too adaptive, and that one can 
identify a set of points of size linear in the tree depth so that preserving all the distances from those points 
to X ensures the correctness of the algorithm. The details are in Section [3] 

Our lower bound builds on the argument from BDIPWlOl . where the number of measurements was lower 
bounded by encoding long bit sequences into the signal x, such that those bits could be unambiguously 
decoded by the sparse recovery algorithm. The encoding proceeded on several distance scales. At each 
scale, the encoding used a large set of almost equidistant k-sparse vectors as the “dictionary”. Since the 
maximum size of such sets is directly related to the doubling dimension of the space, the lower bound 
argument goes through in the setting of a general norm. The details are in Section iBl 

The doubling dimension of the set of k-sparse vectors under EMD was previously studied by BGKKIOL 
who showed that it is at most 0{klogk) for the special case of measures induced by k-sets, i.e., measures of 
granularity l/k. For this case it is in fact not difficult to improve the bound to 0{k) and we give an outline 
of the improved argument in Section 13.11 However, for our applications we need a bound that holds for 
general measures. This makes the argument more complex, since we need to deal with general flows. In 
bofh cases fhe idea of fhe proof is fo explicifly consfrucf a covering of a ball of radius R using a small number 
of balls of radius 7?/2, by using fhe geometric and combinatorial properties of planar flows. The details are 
in Section [3T] (for general non-negative vectors) and Section|0(for vectors of bounded granularity). 

Our improved analysis of the embedding of IIIT03L as well as the analysis of the embedding from IIInd07L 
utilize the fact that our application allows us to relax the standard embedding definition in two ways. First, 
we only need to preserve the distance between a k-sparse vector and a general vector, as opposed to between 
any pair of vectors (see the statement of Theorem 1 1.1 1 for the precise guarantee that we are after). Second, 
we only need to ensure that the distances are preserved with constant probability, not in expectation, which 
means that we can tolerate events that incur high distortion as long as they occur with low enough probabil¬ 
ity. Combining the two relaxation^ with a more careful analysis allows us to achieve the improved bound, 
surprisingly almost without any modifications to the embeddings themselves. The details are in Section iDl 

We note that if one wants to preserve EMD between two vectors that are both k-sparse, then one can 
embed those vectors into ii-e with distortion 0{logk) IIBI14L which yields a sketch with the same distortion 
and constant size IIIndObL Also, for the case when one of the vectors is k-sparse, a recent work IIY014II shows 
a sketch with distortion 0(min(k^,log?i)) and size roughly 0{log^n). The sketch in this paper substantially 
improves over the latter bound. 

For the 1-median problem we solve an -regression problem. We give oblivious sketches that provide 
subspace embeddings for the ^i-norm for r/-dimensional subspaces with a “disjoint basis” property that 
arises in this setting. Our embedding works when the basis is expressible as the union of a small number of 
sets of vectors, where in each set the vectors have disjoint support. Unlike existing oblivious embeddings for 
£i |ICDM+131IMMT^ISW111IWZ13II . we obtain (1 -|- e) instead of poly(d) distortion, and low d instead of 
constant probability of failure (to simultaneously preserve norms of all vectors in the space). Our embedding 

*It can be seen that both relaxations are needed in order to achieve the better bound. In particular, the expected distortion of 
the embedding is 0(logn), even for a pair of 1-sparse vectors. Similarly, if k = n, the distortion of the embedding is flflogn) with 
probability 1 — o( 1). 
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maps n-dimensional vectors to 0{d/e^\og{d/{8e))) dimensions. We overcome non-embeddability results 
for IIBC051ICS02II by using a non-convex estimator. This is reminiscent of estimators for data streams 
UlndObl . but complicated here by the fact that we require the shonger notion of a subspace embedding. It is 
known (see, e.g., BABSIOH I that for constant d and e one can solve the 1-median by taking 0(1) samples and 
solving the problem on the samples, but this cannot be expressed as a linear sketch with fewer than fl(logn) 
measurements (the sampling lower bound follows from Theorem 8 of HJSTllll l. whereas we achieve 0(1) 
measurements. The details are in Section |B 

2 Preliminaries 

EMD. We start by defining EMD. Consider any two non-negative vectors x,y : [A]'^ —>• such that 
||x||i = ||y||i. Let r(x,y) be a set of functions y : [A]'^ x [A]^^ —>■ M+, such that for any i,j £ [A]^^ we have 
y{i, 1) = Xi and tOJ) = yj- Then we define 

EMD*(x,y)=inf £ y{ij)\\i-j\\, 

^ iMAY 

Nofe fhaf if x and y are characferisfic vectors of some sefs A,B C [A]"^, fhen EMD*(x,y) is equal to fhe value 
of fhe minimum cosf mafching befween A and B. 

Eor fhe case of general vectors x,y, we define 

EMD(x,y) = inf EMD*(x',y)+D[||x-x'||i-|-||y-y||i] 

x' <x,y'<y 

\Wh=\\y'h 

where D = dA is fhe diamefer of fhe sef [A]"^. 

Metric spaces. Eor a mefric space {X,dx), we define Bx{u,r) or, equivalenfly, Balljf («,?") to be fhe ball 
centered af u or radius r confaining all poinfs from X wifhin r from u: Bx{u,r) := {x G X : dx{u,x) < r}. 
Eurfher, for a mefric space {X,dx), the doubling dimension is the smallest number d such that, for every 
r > 0 and any x G X, we can choose xi,X 2 , ...,X 2 rf G X with 


Bx{x,r) (lBx{x\,r/2)[JBx{x2-,r/'^)^ ■■■'^Bx(x 2 d jjT). 


Einally, for X > 1 we define a K-quasi-metric space as a varianf of a mefric space, where we have fhe 
following relaxed hiangle inequality: d{x,y) < K ■ (^d{x,z) + d{z,y)) ■ Thus, every mehic space is a 1-quasi- 
mefric space. We define X-quasi-norms in an analogous way. 

3 Upper Bound on Measurement Complexity 

Suppose we have a X-quasi-mefric space Ai = (X,p) and a closed subsef T C X wifh doubling dimension 
d. Eel us assume we can skefch disfances befween poinfs from X and Y wifh disforfion D, skefch size s and 
success probability af leasf 2/3 (see Theorem 1 1.1 1 for fhe formal definition). 

The following Lemma builds on a resulf from IIKL04II on approximafe nearesf neighbor search in dou¬ 
bling spaces. 

Lemma 3.1. For every 0<e<l/2, 0<A<A and yo FY one can sketch points ofX with sketch size 

o(^s- (^dlog(DX/s) + loglog(A/A)^^ 
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Search Procedure: 

yo ■(— the only element of Nq 

for / ^ 1... L do 

Si^NiDBxiyi-uPri) 
yi ^ argmin^,^^, q{y) 
if q{yi) > yri then 
return y(_i 
return yi 


so that from this sketch for x with p{x,yo) < A we can recover with probability at least 2/3 a point 
y GY such that 

p{x,y) < max((l + e)DK■ p{x,Y),X). (3) 

Proof First, we describe the recovery procedure and then show how to sketch points. For now, we assume 
that for the point of interest x^X and for every y G F we know a number q[y) such that 

p{x,y)<q{y)<D-p{x,y). (4) 


The recovery procedure we describe has several parameters: a positive integer L, a real 0 < a < 1 and 
real D. For the reasons that will be clear later we require that 

■ [a + ly) <a^. (5) 


The recovery procedure is as follows. First, for every 0 < i < L we build a r,-net A, of Y nBx{yo,bf), 
where r,- = 2a'A such that all pairs of points from A have pairwise distances larger than r,-. In particular, 
|Ao| = 1, and for every i the size of A, is finite since the doubling dimension d of T is finite. Such a net 
can be found using a straightforward greedy algorithm. Second, given a point x £X with p{x,yf) < A we 
recover an approximate nearest neighbor from Y as follows: 

Now let us analyze this procedure. Denote y* = argminj,^^ p(x,y) one of the nearest neighbors for x 
from Y (note that y* exists, since Y is assumed to be closed). The proof follows from the following three 
claims (the proofs are in Appendix IaI). 

Claim 3.2. If ^ holds and for some 1 < / < T one has q{yi-\) < 7r,-i, then p(y*,Si) < rj. 

Now let us analyze the case when the algorithm returns y,_i for some 1 < / < L. 

Claim 3.3. If^ holds and the algorithm returns yt^ i for some 1 < / < T, then 

PiXAi-i) ^ DKy 
p{x,y*) a{y-DK)' 


Next, suppose that our algorithm returns y^. 

Claim 3.4. Z/® holds and the algorithm returns yi, then p{x,yi) < 2ya^A. 

Let us now show how to set L, a, (5 and y. Claims [331 and l3. 41 imply that in order to satisfy Q we need 
to satisfy together with ® the following conditions: 


DKy 

a{y-DK) ~ 
lya^A < 


{\+e)DK, 

X. 


( 6 ) 

(V) 
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It is immediate to see that we can satisfy Q, Q and O simultaneously by setting a = \ — 0(e), j3 = 
@{DK^/e), y = @{DK/e) andL = 0(^i-log^y 

So far we assumed that we have access to a function q{-) that satisfies (jUl. In reality we build such a 
function from sketches of distances between points from X and Y. Suppose we can build a subsej^ Q QY 
with \Q\ < N such that for a given x (zX the recovery procedure can query q{y) only for y (z Q. Then, we 
can use the standard amplification argument for the median estimator, and sketch in size C?( 5 log A^) to get a 
randomized function c/{■) such that for every y G 2 one has Pr [p{x,y) < q'{y) < D • p{x,y)] >1 — ^. Now 
we use q'{-) for the recovery and by the union bound the recovery algorithm succeeds with probability at 
least 2/3. It is only left to upper bound N for an appropriately chosen set Q. 

It is clear that we query q{-) for points only from 

U U 

ie[L] ;e[L] 

1) < r-'"/-1 p 1) < r - n -1 

(the inclusion is by (lUl). By Claimthe right-hand side is included in 

Q= U {NinBx{y*,K^-{l + 2p)-ri)). 

1 <KL 

Since Y has doubling dimension d and points from Ni are r,-separated, we get N = |2| < L • {K^ ■ (1 + 
2 ^)) 0 (tt) using the values of L, a, p and 7 , we get that the final skefch size is C?( 5 logN) < 0{s ■ 

{d\og{K • (1 + i8)) + logL)) <o(s- (d\og{DK/E) + loglog(A/A)) ) . □ 

Corollary 3.5. Suppose that X is induced by a norm of dimension n, and that there is an algorithm that 
computes the sets Si defined by the search procedure in time Then the search procedure runs in time 

polynomial in N = L - [K^ • (1 + 2j8)) and n. 


3.1 Upper bound on the doubling dimension of EMD 

We will prove fhaf fhe doubling dimension of k-sparse probabilify measures over [A]^ equipped wifh EMD is 
C?(kloglogA). For a weaker and simple bound 0{k\ogk) on fhe doubling dimension in fhe case of k-sparse 
subsets, see HGKKIOII . In facl, if is nof hard fo prove upper bound 0{k) on fhe doubling dimension for 
k-sparse subsefs. Notice fhaf in fhis case fhe upper bound on fhe doubling dimension does nof depend on fhe 
size of fhe grid. We will now provide an infuifion why fhe upper bound 0{k) holds. 

We have an EMD ball BallEMD(At 5 ^) or radius R cenfered af k-sparse measure p such fhaf p{x,y) = 1 
for all (.r,y) G supp(jU). (We can fhink of as a k-sparse sef.) And we would like to cover all k-sparse 
subsets within BallEMD(Atj^) with EMD balls of radius R/2 centered at k-sparse subsets. 

First, let’s show how to cover all subsets o G BallEMD(At,^) with o satisfying ||o,- — /r,||i = Q{R/k) for 
all / G [k]. Oi and Pi denote points in supp(o) and supp(jU,) and they get matched togefther in the optimal 
transportation between o and p. For this, we take /?/(100k)-net of Balbj (/r,-, 107?/k) for every i G [k]. Every 
such net is of size 0(1). To cover all the o G BallEMD(Atj^)> we need to take a representative from a 
net from Balbj {pi, \0R/k) for all i G [k] and combine representatives in k-sparse subsets. There are 
possible ways to construct subsets by taking representatives. 

In the case when we do not have the mentioned guarantee at the beginning of the previous paragraph, 
we can guess values ||o,- — /r,|| 1 up to a constant factor and construct covers for all guesses. We need to show 

®Note that Q is more than just a single path from the “root” to the solution, as the behavior of the algorithm is not deterministic 
and depends on the random bits chosen by the sketching procedure. 





that it is enough to take at most guesses. And it can indeed be shown by noticing that we do not need 
to cover balls of very small radius (when ||o; — ft,Hi is small). 

We proceed by showing upper bound on the doubling dimension when we consider arbitrary measures 
with support of size at most k living in a square of side length A. 

Lemma 3.6. The doubling dimension of the set of k-sparse probability measures over [A]^ under EMD 
metric is C?(kloglog A). 

Proof Let ft be a k-sparse probability measure over [A]^ and let /? > 0 be some real number. Our goal is 
to cover with log*^^*^ A EMD-balls centered in k-sparse measures and of radius R/2. In order 

to achieve this it is sufficient to cover BEMD(ft)^) with log^^*^^ A EMD-balls centered in arbitrary measures 
and of radius R/A. 

The pseudocode in Eigure [3] builds a set of measures M. that serve as centers of balls with radius R/A 
that together cover BEMD(ft)^)- Roughly speaking, we first guess the topology of the optimal flow. Then we 
guess the lengths of the corresponding edges. Then we guess the support. And finally we guess the masses 
transported over the edges. 

We assume that BuiLDNET(/7,r) returns an (r/100)-net of Bp^{p,r) n [A]^. It is immediate that \ J\A\ < 
mo t— R/(I00Ak) 

for c: supp ft —^ Z>o such that L(x,y)€.suppi( c{x,y) < 2k do 
I^{{x,y,i) I {x,y) e supp ft, I <i<c{x,y)} 
for Z: {I,I.0I,I.0I2,...,2A} do 

for {x,y,i) gX and for all p{x,y,i) G BuiLDNET((.r,y), l{x,y,i)) do 
for m: X —)• {0,mo, 1.01 • mo, 1.01^ • mo,... ,min(l,R)| do 

if for every {x,y) G supp ft we have (x,y,i)ei^K^^yd) — then 

let ft' be a measure over [A]^ that is identically zero 
for {x,y) G supp ft do 

for i: {x,y,i) G X do 
s •(— s + m{x,y,i) 

P'{p{x,y,i)) ^ lJ.'{p{x,y,i))+m{x,y,i) 
p'(x,y) ^ p'{x,y)+p{x,y)-s 
M t-AdU{ft'} 

Eigure 3: Pseudocode for net construction 

logO(^) A, and that the running time of the above procedure is also log^^^^ A. It is left to show that for every 
k-sparse ft' such that EMD(ft,ft') < R, there exists ft" G fA with EMD(ft '\P') < R/A. 

Claim 3.7. There exists an optimal flow between ft and ft' that is supported on at most 2k pairs of points. 

Proof Consider an optimal flow from ft to ft'. Consider an undirected graph G={y,E) with V = supp (ft) U 
supp(ft'). We connect two vertices (x,y) G supp(ft) and (x',/) G supp(ft') iff there non-zero amount flowing 
from (x,y) to (x',y') in the flow. 

If lEl > 2k-|- 1, then there is a cycle ei,e 2 , € E of even length 2m in the graph G. W.l.o.g. assume 
that the total length of e, with even / is at most the total length of e, with odd i. Eet us increase all flows 
over ei with even i and decrease all flows over et with odd i by the same amount such that at least one edge 
carries zero flow. Clearly, the total cost can only decrease. 

Repeating the above process several times, we arrive at a flow supported on at most 2k edges. □ 
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Thus, in our enumeration algorithm at least one c{x,y) corresponds to the number of outgoing flow edges 
from {x,y) £ supp (Line 6). When we enumerate 1 there is at least one choice that guesses all the lengths 
of the corresponding edges within a multiplicative factor of 1.01 (Line 8). Thus, there exists a measure /r 
(not necessarily /c-sparse) such that 

• supp C supp H U {p{x,y, /)} 

• EMD{p',il) <R/50- 

• there exists a flow between p and Ji of cost at most 1.02 • /? that transports mass from a point {x,y) to 

{(JC, j)} U ^ SUpp/i. 

We have that ]1 covers p', i.e, that EMD(ju',/I) < 7?/100, and that the procedure in the pseudocode 
will guess supp(jU) but not necessarily p. After guessing the support (Line 9), the pseudocode proceeds by 
trying to guess the measure at the support (Line 10). We will show that there will be guess p" made by the 
pseudocode with supp(/r") C supp(/r) that satisfy EMD(/i,/i") < 2R/50. 

Eix (x,y) esupp/i. Weshow to deal with the multi-set {(x,y)}U{p(x,y,/) : {x,y,i) SX}. We round down 


• m. 


O’ • 


,mm 


the mass in p at the coordinates {p{x,y,i)} to the closest element of {0,mo, 1.01 • mo, l.OL 
(Eine 11). Eet/r" be the resulting measure. Wealso set p" {{x,y)) — l^” 

(Eine 19). One can observe that p" is included in the set measures enumerated by our algorithm. 

We now show that EMD(/i",/I) < 2R/50. The cost of EMD(/i^^,jU) comes from two sources: 

1. Contribution from {x,y,i) G I for which p{p{x,y,i)) < mo. Then p”{p(x,y,i)) = 0. There are at most 
2k such {x,y, i) G X. But we can reroute these small masses with cost at most 2Afcmo < 0.02/?. 

2. Contribution from (x,y, /) G X for which p{p{x,y,i)) > niQ. This implies that the value of p{p{x,y,i)) 
is within 1% of p"{p{x,y,i)). Therefore, the total contribution of such coordinates {x,y,i) G X is at 
most 0.01 •EMD(jU,jU) <0.02/?. 

Thus, overall we have ||jU'— /i"||emd < — mIIemd + ||M — M^^IIemd < (0.02 + 0.02 + 0.02)-/? < 

R/4. □ 
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A Proofs from Section |3] 

Proof of Claim\3f2\ Eet / G Nt be a point such that p{y*,y') < r, (recall that A,- is an r,-net of F). Clearly, 

it is sufficient to prove that / G Si. This is equivalent to the condition p{y,yi-i) < j8r,-. Eet us verify the 

latter: 


P{y,yi-i)<K-{p{y*,y')+p(y*,yi-i)) < A- (r,--hp(/,y/_i)) <K^ ■ {ri + p{x,y*)+p{x,yi-i)) < 
<K^- {ri-\-2p{x,yi-i)) < • (r;-f 2<7(y;_i)) < ■ {riJ-lYn^i) = ■ (a-yly) < ajSr;_i = jSr,-, 
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where the third inequality follows from the definition of y', the fourth inequality follows from the definition 
of y*, the fifth inequality follows from (01), the sixth step follows from the statement of the Claim, and the 
penultimate step follows from ([51). □ 


Proof of Claim \3.3\ First, observe that by (0]) and the fact that the algorithm returns y;_ i we have p {x, i) < 

< YC-i- Second, by (0]) and Claim 

^ <K-{p{x,y*) + p{y*,Si)) <K- {p{x,y*) + n). 

Thus, 

Overall, 

P{x,yi-i) ^ TT-i ^ DKy 
p{x,y*) a{y-DK)' 

□ 


Proof of Claim U^ If the algorithm returns yt, then 

p{x,yL) < yrt = 

where the second step follows from the definition of r,. □ 

B Lower Bound on Measurement Complexity 

We use a <b to denote that there exists a universal constant C such that a < Cb. We use a>b to denote 
b <a and arz bto denote a <b <a. 

We work with the linear sparse recovery scheme as in the Introduction (Equation ([T])). We set the 
probability of error to be F = |. 

The following lemma generalizes the result in BDIPWIOH to general norms and nonnegative inputs. 

Lemma B.l. Consider any norm H-H over for which ^ < ||^ < 1 ^^ for some constant c. Further suppose 
that there exists a set X C [0,oo)" ofk-sparse vectors such that ||x|| ~ 1 for all x & X and \\x — •^'11 ^ ^for 
all xf^yfaX. Then any linear sparse recovery scheme with approximation factor K over [0,oo)” must use 
m > linear measurements. 

Proof We first show a set of assumptions we can make without loss of generality, then give an algorithm to 
solve augmented indexing using sparse recovery, then analyze the algorithm. 

WLOG assumptions and setup. First, we show that we can assume that x £X have coordinates that 
are multiples of Let f be x rounded to the nearest multiple of in each coordinate, so 

||x —x'lloo < l/n^+^ Therefore ||x —x '||2 < or ||x —x'|| < Ij^/n. This means that replacing x with 

x' would also satisfy the conditions with negligibly worse constants and have coordinates that are multiples 
of 

We would like to give a lower bound for all randomized sparse recovery schemes that work for each input 
with 3/4 probability. By Yao’s minimax principle, it suffices fo give an explicif disfribufion on inpufs for 
which no deferminisfic sparse recovery scheme (A, .A) can work wifh 3/4 probabilify. Furthermore, we may 
assume that A G has orthonormal rows (otherwise, if A = ICLV^ is its singular value decomposition, 
IPU'^A has this property and the transformation can be inverted before applying the algorithm). 

We use the following lemma from BDIPWIOH : 
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Lemma B.2. Consider any mx n matrix A with orthonormal rows. Let A’ be the result of rounding A to b 
bits per entry. Then for any v € M” there exists an s € M" with A'v = A(v — s) and pH i < n^2^*||v|| i. 

Proof. Let A” = A— A' be the roundoff error, so each entry of A” is less than 2^^. Then for any v and 
5 = A^A"V, we have As = A"v and 

Pill = ||A^AP||i < -yn||Ap||i < ni\/n2^^||v||i < ?i^2^*||v||i. 


□ 

Now, let A' be A rounded to d log n bits per entry for d to be chosen later. By Lemma IB^ for any v we 
have A'v = A(v — 5 ) for some 5 with p||i < n^2^‘’ *°®''||v||i, so 

Pll < ?i^'^^^~'^'||v||. 

We are now ready to construct the lower bound of m via a reduction from the one-way augmented 
indexing problem in communication complexity. In this problem, Alice has a bit string b of length rlog |X| 
for r = logn, and Bob has an index i* G [rlog |X|] as well as bi,... ,bi->_i. Alice must send a message to 
Bob, who must output bi* with 2/3 probability. It is known that the message must contain fl(rlog |A|) = 
Q(lognlog |X|) bits. We will show a way to use the sparse recovery algorithm to solve augmented indexing 
with 0{m ■ logn • log A') bits, giving the lower bound of m > 

Algorithm to solve augmented indexing. Alice turns her rlog |X| bits into a list xi,... ,Xr £ X. She then 

defines 

z = t^i/{Kcy 

for a sufficiently large constant integer C to be specified lafer, and 

y = A'z. 

Since ||z|| < YH^iW^iW/{^C)' < 1, we have thaf y = A{z — s) for some s wifh p|| < Alice fhen 

sends y fo Bob. 

Transmitting y fakes 0{m ■ logn ■ logA") bifs. To see fhis, nofe fhaf each coordinafe of z is a mulfiple 
of thaf is af mosf n‘^, and each coordinafe of A' is a mulfiple of l/?i‘ thaf is af mosf 1. Hence 

each coordinate of y = A'z is a multiple of most which can be represented in 

log(«^+^^+^(A'C)'') < logn • {d + logA') bits. There are m coordinates, so transmitting y takes 0{m ■ logn • 
(c' + log/!r)) bits. 

Now, based on his inputs bi,... ,bi*-\ and i*, Bob can figure ouf xi,... and wanfs fo figure ouf 
xy for /' = 1 + [T/log |X|J. Once he learns y = A'z = A(z — s), Bob chooses u G [0, uniformly af 

random, and compufes 

y' = {KC)^{y-A £ Xi/{KCy) +Au. 

1=1 

Bob fhen performs sparse recovery using A on y' getting a resulf x. He rounds x fo fhe x £X minimizing 
p —£||. We will show fhaf x = x,/ wifh af leasf 2/3 probabilify; if fhis happens. Bob can recover bi* from 
fhe associated vector x,/. 
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Analysis of algorithm. We have that y' = A{z' — s-\-u) for z',s with pH < and 


z' = Xi! + ^ Xi'^jjiKCy =: Xi' + w 
;=i 


for w = L 1 ^('+i/having 11w11 < 1 /(^C). Then y' = A (v,-/ +w + u — s). 

For now, pretend that Bob performed sparse recovery on A(v,v +w + m) instead of A(v,/ + w + m — 5 ). The 
distribution of v/ + w + m depends on the distribution of inputs to the augmented indexing problem, but it is 
independent of the choice of A and is over [0,oo)”. Therefore we can choose our A to be a matrix that lets 
us perform sparse recovery with 3/4 probability over this distribution. Then the result v of sparse recovery 
satisfies 

\\x—{xi!+ w+ u)\\<K min _||v,v + w + u — v|| < A'||w + n|| (8) 

k-sparsQ X 

with 3/4 probability, or 

ll^-^HI<^(l|w|| +Hull) < 1/C (9) 

If C is a sufficiently large constant, this is less than 


min ||v —y ||/2 > 1 . ( 10 ) 

x^x'ex 


Therefore, when Bob rounds v to X, he gets v,/ whenever sparse recovery succeeds, as happens with 3/4 
probability. 

In fact. Bob performs sparse recovery on A(v;' + w + u — s) not A(v,/ + w + u). However, the latter is 
statistically close to the former. In particular, H^Hoo < so that the total variation distance 


TV(m,u — s) <n- 


^3c+2-c' 

l/(Xn^+i) 


<Kn 


4c+4—c' 


Setting c' = 4c + 5 + we get that 


TV(A(v,v + w + u),A(v,-' + w + u — s)) < TV{u,u — s) <l/n. 


Therefore Bob’s rounding of v to X will equal Xji with probability at least 3/4 — 0{l/n) >2/3. This solves 
the augmented indexing problem with only 0{m\ogn ■ {c' + logX)) = 0{mlogn ■ logX) bits of communi¬ 
cation. Since augmented indexing requires fl(rlog|X|) = fl(lognlog |X|) bits of communication in this 
setting, we have m > . □ 

Proof of Theorem\r2\ Define S = [0,oo)" n Tj-n {v £ M" : ||v|| < 1}. 

Because fhe space and fhe norm are homogeneous, we have by fhe definition of doubling dimension fhaf 
covering S requires 2^ balls of radius 1/2. Therefore we can find a packing X C S of 2^ poinfs such fhaf 
ex\y—^\\ > 1 /2- This also means af mosf one v GX has ||v|| < 1/4. Thi'owing fhis possible elemenf 

ouf, we gef a sef of size 2^^ — 1 salisfying fhe consfrainfs of Lemma lBAl giving fhaf m > ^ 


B.l Lower bound for streaming algorithms 

In fhis section we show a lower bound on fhe space bif complexify of any sfreaming algorifhm fhaf mainfains 
an approximafely besf k-sparse approximafion of a vector wifh respecf fo any norm || • || on M” such fhaf 

IaI|2 

for every v G M”. 
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Theorem B.3. Suppose that there is an algorithm that can maintain a vector a: E M” under updates of the 
form Xi : = Xi + 5i, where 5,- G Z, and, moreover, suppose that we are promised that all entries of x at any 
moment of time are integers between 0 and n^^^\ In the end, the algorithm is required to output a vector y 
such that 

11-^ —min ||;c —a:*||, 
k-sparse x* 

where K > 2 is some approximation factor. Then, the space bit complexity of the algorithm must be at least 




where d is the doubling dimension of the non-negative k-sparse vectors under 


The rest of this section is devoted to proving this Theorem. We roughly follow the above argument for 
proving the lower bound for sparse recovery. However, in this case, the argument is even simpler since we 
do not need to handle issues related to the sketching matrix. 

First, we take r = 2^^^^ non-negative k-sparse vectors vi,..., whose || • ||-norm is 0(1) and that are 
fl(l)-pairwise separated wrt || • ||. We will show how Alice and Bob can solve Augmented Indexing on 
b = Q. strings using the assumed algorithm. Alice partitions her b-hii sequence into blocks of 

length logr = Q.{d), encodes each block in one of the v,’s (denote it by Uj for 1 < y < b/\ogr), and then 
feeds the (properly rescaled and discretized) vector 


6/logr 

h 

where C > 0 is a sufficiently large constant, to the algorithm. Bob takes over, starting from this moment, 
subtracts the part of U that corresponds to his prefix and fhen uses fhe algorifhm fo recover fhe nexf uj. 

Overall, we have fhaf fhe required space is af leasf Q.{b) = Q.(^d- ■ The only remaining facl we 

need fo argue abouf is why fhe accuracy per enfry is sufficienf. Firsf, we use (fTTl) fo claim fhaf fhe 

polynomial in n accuracy is enough fo represenf v;’s wifh fhe required condifions. Second, since b/logrfu 
log n/log A', we gef fhaf U can be represented wifh accuracy polynomial in n. 


C Additional Bounds on Doubling Dimension for EMD 

C.l Upper Bound for Measures with Bounded Granularity 

Lemma C.l. Consider set S of all k-sparse measures p such that, for all coordinates (x,y), p {x,y) is equal 
to i/N for some non-negative integer i and the total measure of p is 1. The set S, under EMD, has doubling 
dimension C?(kloglogA). 

Proof Eel BallEMD(F)r) be EMD ball of radius r confaining k-sparse probabilify measures of granularity 
1 /N over plane, p G 5 is fhe cenfer of fhe ball. Eurfher down we will denofe BallEMD(F) 1) by B. 

Case 1. |supp(p)| = 1. WEOG, fhe enfire probabilify mass of p is af poinf (0,0). We can verify fhaf 
p e B implies supp(/t) C [—lOOA, lOOA]^. 

Eel B' be fhe sef of all probabilify measures p wifh properfies fhaf p has granularily l/n, supp(/t) C 
[—lOOA, lOOA]^, all coordinates of poinfs from supp(/t) are of fhe form for an integer i. We can verify 
fhaf, for every p €B, fhere exisls p' G B' wifh ||/i — ju'|| < < 1/1000. 
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Therefore, if we construct ^-cover X of B' as per Lemma 1X6] of size |X| = X is also 

cover of B and we get the required upper bound. 

There is one issue, though. It might be that the measures from cover X does not have granularity 1 /N. 

To deal with this, we first build ^-cover X of B' according to Lemma 1X61 Then, for every measure B 
from the cover, if /r is not of granularity 1 /N and EMD ball of radius 2^ around B does not contain any 
measure of granularity 1 /N, we discard this measure because it does not cover any measure of interest. If, 
on the other hand, the EMD ball of radius 1 /200 does contain measure jj.' of granularity 1 /N, we replace /r 
with jj.' in the cover. Clearly, increasing the radius by a factor of 2, still covers all the previous points, i.e., 
BallEMD(At, 1/200) C BallEMD(At^ 1/100). 

Case 2. | supp(p)| > 1. We denote elements of supp(p) by {x,y). Because of granularity of measures, if 

HGB, then supp(At) C U(;c,>.)esupp(p)where 

= [a: - 100A^,.r+ lOOA^j x [y - 100A^,y + lOOA^j 
denotes a square in plane with side length 200A^. 

We construct a graph with vertices S(^x,y)^ (-^d) C supp(/7). We connect two vertices if the corresponding 
squares have non-empty intersection. We consider connected components of the resulting graph. We want 
to move the connected components so that, in the end, all of them live inside a square of side length lO^A^^ 
and distance between any two connected components is > 10^A^. We can verify that we can do that. 

Eet p' denote the resulting measure and (x',y') G supp(p') be the resulting elements of the support. We 
round the coordinates of the elements of supp(/7') so that all x' and y' are of the form for some integer 
i. Eet p” be the measure after the rounding. 

We can check that EMD(p',p") < ™iqqo^^ < 1/1000. Therefore, if weconstruct -cover of BallEMD(y^ 1-1), 
we get ^-cover of BallEMD^', 1)- 

Consider all probability measures from BallEMD(p^^ f-2) with the property that all coordinates of ele¬ 
ments of supports of measures have form for some integer i. We denote this set by Ballg]y[Q(/7", 1.2). 
2^-cover for BallEMD(p"> 1-2) gives 5^ + < y^-cover for BallEMob", LI). 

To construct ^-cover for Ball£]y[j5(p", 1.2), we start by constructing ^-cover of Ballg]yjj5(p", 1.2) by 
measures not necessarily having granularity To get measures with granularity jj, we proceed in the 
same way as in Case 1, i.e., we consider 2 cases. If a measure in the ^-cover does not have a measure of 
granularity ^ within EMD distance then discard this measure from the cover. Otherwise, replace the 
measure with the measure that has granularity We can see that the set of measures that these operations 
produce, is 2 • ^ = ^-cover BallEMD(P^^ 1-2) and has granularity Erom Eemma lX^ the size of the 
cover is (logA/^)'^^^/ As a result, we have ^-cover of BallEMD(/^/1)- All measures in the cover have 
granularity 

Given ^-cover of BallEMD(y, 1)> we would like to construct ^-cover of B. Given that p and all 
measures from the cover have granularity jj, we can make the following assumption. The optimal trans¬ 
portation of probability measure from p to every measure from the cover has probability mass on every edge 
of amount jj for some non-negative integer i. As aresult, if is a measure from ^-cover of BallEMD(p^ 1), 
in the optimal transportation of p' into p (that achieves cost EMD(p',/i)), p has non-zero amount on edges 
to elements of supp(/7') that corresponds to at most one component of the graph. (This follows because the 
connected components are highly separated in p'.) This gives that we can move components independently. 

We move the components to their original positions (according in p) and accordingly transform measures in 
the cover. This gives ^-cover for B. □ 
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C.2 Lower bound on the doubling dimension 

Lemma C.2. Weighted point-sets over [Al of cardinality k under EMD has doubling dimension D.(k)- 
log {Q.{log j)) fork > 1 . 

Proof. WLOG, we assume that A is an integer power of 2. By {x,w) we denote a point with coordinate 
and weight w. Let set A be a weighted point-set of size k. For i = 1,2,3, we set the /-th point of A to 

be A, = {2iA/k,2). The remaining ^/2 points has weight 0 and arbitrary coordinates on the line. 

Let I = i\, / 2 ,..., 4/2 for 0 < ij < logt/ (we will later set U = A/k), and let Bj be a point-set defined as 

Bj = {{2jA/k,2 - 2-'f), {2jA/k + 2'f, 2-'f)} , 

We constructed Bj such that EMD(A,B/) = ^/2 for all I. 

Consider an EMD ball of radius ^/2 around A, i.e., BallEMD(A,^/2). We will show that the number of 

k.2_ 

EMD balls of radius k/200 needed to cover it is (log /2^, which yields the result. 

Consider a ^-cover of BallEMD(A,^/2). We will show that the size of the cover must be large. Consider 
a weighted pointset B[ for some I = ii,i 2 , •••,4/2- Given that Bj is covered by an element from the cover, 
there must be an element C from the cover with the property that at least 9/10 fraction of intervals 

[2jA/k + 2 'f - 2 'V 10 , 2jA/k + 2'> + 2'> / 10 ] 


(for 7 = 1.. .k/2) contains an element from supp(C). We call that C hits Bj and the set of elements of 
supp(C) that is contained in some interval we call the hitting set of B/. Otherwise, for any C that does not 
satisfy this property, we have 

, . M 1 I k I k 


There are 


EMD(B/,C)> ( 1-- 


|{B/|/ = /i,/ 2 ,..., 4/2 and0<4 <logU for/e [k/2]}\ = (log-- 




pointsets B/ that are covered. 

Consider an element C from the cover. We will show that C can hit at most 2*^ • (log f) ‘ sets B/. 

This will finish fhe proof. 

There are af mosf 2^ subsefs D of supp(C) fhaf can be a hiffing sef for some B/. Every D can be a hiffing 


□ 


sef for af mosf (log f) ^ ^ sefs Bj because |D| > ^ • |. This finishes fhe proof. 

Corollary C.3. Weighted point-sets over [A]^ of cardinality k under EMD has doubling dimension Q.{k) 
log (a{log for k > 1 . 


Proof We wanf fo choose a poinf-sef A and a lof of “highly separated” poinf-sefs B/ similarly as in lC.2l wifh 
EMD(A,B/) = V4. 

For fhaf, we place k/A poinfs wifh non-zero weighl on a line of lengfh A • fo consfrucf A and B/s 
analogously as in Lemma IC21 The difference is fhaf, instead of placing k/2 poinfs, we place k/A poinfs and 
fhaf, instead of having an interval of lengfh A, we have interval of lengfh A • Then we splif poinfs of A 
wifh fheir counferparfs of B/ info s/k/2 consecufive sequences of poinfs each confaining s/k/2 poinfs. We 


puf /-fh sequence in i ■ .^-row of fhe grid. 

We can verify fhaf fhe resulting poinf-sefs safisfy fhe necessary properfies. 


□ 
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D Better Sketches for EMD 


D.l Refined analysis of the grid embedding 

In this section we recall the embedding of EMDj^p into from IIIT031 (building on IICha02ll ) and provide 
the refined analysis of a variant of it, under the assumption that we are embedding a measure that can be 
represented as a difference of two non-negative measures that both sum to one such that one of them is 
/:-sparse. 

We state the following simple lemma without a proof. 

Lemma D.l. For any vector y : [A] — >■ M+, define CDF(y) : [A] — )■ M_|_ by 

CDF(y), = £y,-. 

i<i 

Then for any y,y' : [A] ^ with ||y|| = ||y'|| = 1 we have 

EMD(y,y) = ||CDF(y)-CDF(y)||i. 


The Cauchy distribution is continuous probability distribution with the probability density function 
. ^ 2 n » where 7 is the scale parameter. If not otherwise specified, we will refer to a Cauchy variable 
^V+t) 

as one which is drawn from distribution with 7=1. 

First, we need the following folklore claim that will be useful for us later. 


Claim D.2. Let Xi, X 2 , ..., X„ are (not necessarily independent) non-negative random variables such that 
for every i and t > Owe have 

Pr[Xi>t]<j, 

where C > 0 is some constant. Suppose that S = Y.i otiXi, where a,- > 0, a, = 1. Then, for every 5 > Owe 
have 

Vv[S<Oc,8ifl{a))]>\-5, 

where H[a) is the entropy of the distribution over \n\ defined by Ot. In particular, H{a) < log2n. 


Proof Fet Ti,T 2 ,... ,Tn be non-negative parameters to be chosen later. Denote £ the event “for every i one 
has Xi < T”. Then, by the union bound, 

n f-' 
i=i 

and for every i one has E [Xi \ £] < C?c(log7^). Thus, by Markov inequality. 


Pr 


5 < Oc,5 (£«, log7;) I > 1 - S/2. 


Thus, we are looking for T/s such that ^ < 5/2 and Er is minimized. Via simple calculus, we 

obtain the desired inequality. □ 


Fet us remind, how the embedding from IIIT031I of EMDj2ij2 into £i works. For the sake of exposition, 
let us assume that A = 2^ for a non-negative integer 1. 

For 5 = (^i,52) £ and 0 < ? < / we define a linear map '■ ^ h as follows. We first impose a 

grid Qs,t over 1? with side length 2' so that one of the corners is located in 5 = {si,S 2 ). Then, for a measure 
jU € we define Gs.tP G as follows: for every square of the grid we count the total mass of p that is 
located there. Then, we define the following (linear) embedding Gs of into parametrized by a shift 
5 = (51,^2) G I?-. GsP := -Gs^tp. 

In IIIT03II the following properties of Gs have been proved. 
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Theorem D.3 ( 111X0311 '). For every jj. G EMDj^p/ 

• for every s = G one has ||/i||emd < 0{\) ■ ||Gi/i||i; 

• E5g[A]2 [||Gi/l||i] < O(logA) • llAillEMD- 

Then, concatenating Gg for all s G [A]^ one obtains a deterministic embedding of EMDj^p into £i with 
distortion O(logA). 

Now we turn to the refined analysis of the above embedding. 

Definition D.4. Eor G and 0 < 7? < 2A consider an ^i-ball in the plane Bf.(x,R). Suppose that we 
sample a shift s = (^i ,,s'2) G [A]^ uniformly at random. Consider the following random variable ,4 ;c.r(5 ): 

min ^|2': 0 < f < /,the grid Gst does not cut the ball U {2^+^ 

Ax^is) :=- - -. 

In words, we are looking for the side length of the finest out of / + 1 grids that does not cut the ball of 
interest, or 2^+', if it does not exist. 

Implicit in IIIT03II are the following two Eemmas. 

Lemma D.5 f llIT031 ). There exists C > 0 such that for every x G 0 < R < 2A and t > 0 one has 

PEe[Ap [A,r(‘5') >t]<j. 

Proof One has > t iff the coarsest grid with side length less than R ■ t (which is 0(7? • t)) cuts the 

ball (x,7?). It can be easily verified fhaf fhis probabilify is 0(1 /t). □ 

Lemma D.6 ( KT03ll i. For every two points x,y G [A]^ and uniformly random s = (^1,^2) G [■^]^ the quantity 
\\Gs{ox — g)||i. where ex and ey are basis vectors that correspond to points x and y, respectively, is upper 
bounded by 0(1) -7? • Ai,fi{s) for every u gM? and 0 < 7? < 2A such that the ball B^^ {u,R) contains both x 
and y. 

Proof All grids fhaf are of side lengfh af leasf R ■ Au^r{s) do nof confribufe fo \\Gs[cx — ey) 111 by fhe definition 
of Au^r. All finer grids confribufe towards \\Gs{ex — ey)\\ \ fhe geomefric series, whose fofal sum can be upper 
bounded by 0(1)-7? •PI„ k(5 ). □ 

Combining Lemma I d 31 Lemma IDAI and fhe friangle inequalify, we obfain fhe following Claim, which 
later will be very useful for our refined analysis of the embedding from IIIT03II . Basically, we show that we 
can upper bound ||GyB||i for p. G EMD[4]2 using Claim ID^ 

Claim D.7. Suppose that p and V are two non-negative measures over [A]^ that both sum to one. Assume 
that the optimal transportation of p to V consists of moving mass wifrom the point xi G [A]^ to the point 
yi G [A]^/or 1 < / < p. Let = Bi^ (My,7?y)}^ j be a collection of (.\-balls in the plane such that for every 
1 < f < P there exists 1 < j*{i) < q such that both xi and yi belong to Bj*. For every 1 < y < ^ define 

Wj = £ Wi. 

i- j*{i)=j 

Suppose we sample a shift s = (^1,^2) € [A]^ uniformly at random. Then, the random variable 

\\Gs{p ~ v)||! < '^Wi\\Gs{exj ~^y,-)||l 
1=1 


20 



















is dominated by S = L/=i wjRj-Xifor some non-negative (not necessarily independent) random variables 
X\, Xi, ... ,Xq such that for every i and t > 0 one has 


Pr [X,- >t]<j 


for some absolute constant C > 0. 

Now applying Claim ID^ we conclude the following. 

Claim D.8. Assuming the notation and conditions from Claim \D77\ we have 

Pr, [||G,(At - V) II 1 < 0(1) •//(«)• r]> 0.99, 
where T = L/=i = Lf=i ^ the following distribution over [^].- 


Now we state two applications of this claim that are our main goal. 

Lemma D.9. Suppose that p and V are two non-negative measures over [A]^ that both sum to one and, in 
addition, p has support of size at most kfor some I <k< A^. Then, 

Pr,[||G,(/i - v)||i < 0 (log^ + loglogA) • H/i- v||emd] >0.99. 

Proof. Suppose that {xi,X2, ■. ■ ,Xk} C [A]^ is the support of p. Consider the following family of 0(A:logA) 
balls: {'®(-^o 20 }i<,<^o< 7 <iogA+r consider the optimal transportation of p to v. Every edge of length 
I participating in this transportation can be enclosed in one of the balls of radius 0(1). Thus, we can apply 
Claim lED with T < 0(1) • \\p — v||emd- It is left to upper bound H{a). In this Lemma we use a crude 
bound: namely, that H{a) < logO(^logA) < 0(log/: + loglogA), since the support of a is of size at most 
O(^logA). □ 

Lemma D.IO. Suppose that p and V are two non-negative measures over [A]^ that both sum to one, and 
all the weights of p and V are multiples of \/N, where N >1 is some integer. Moreover, assume that p is 
k-sparse for some 1 < ^ < N. Then, 

Pr, [||G,(At - v)||i < 0(log^ + loglogiV) • ||At - v||emd] > 0.99. 

Proof. The proof is the same as in Lemma ID^ but we need to upper bound PI(a) in a slightly fancier way. 
Let us recall the definition of a. For each of the O(^logA) balls we compute the total mass transported 
over edges that are allocated to this ball and multiply it by the radius of the ball. Since all the masses are 
multiples of 1 /N and for every j < log A we have k balls of radius 2f we can reformulate the question of 
upper bounding H{a) as follows. Suppose that we have a bin for every / S [k] and j > 0. Then, we put N 
balls into these bins (adversarially). Then, for each bin indexed by {i,j) we multiply the number of balls 
there by 2^ and then normalize the resulting numbers so that they sum to 1. What is the upper bound of the 
entropy of this distribution? We prove that it is 0(log^ + loglogN) as follows. Denote j* the largest j such 
that there is i G [^] such that the bin (ij) is non-empty. Then, the bins with j < j* — lOOlogN contribute 
to the entropy negligibly, since we multiply the number of balls in these bins by 2-' <2-> /N^^. But the 
entropy for bins with j > j* — lOOlogN is log(9(^logN) = (9(log^ + loglogN), since the total number of 
these “important” bins is 0{klogN). □ 
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Remark: The terms O(loglogA) and O(loglogA^) in Lemma ID^ and Lemma D.lOi might appear to be 

unfortunate artifacts of our analyses. However, one can show that in both cases the bounds for the embedding 
from IIIT03II are in fact tight. Nevertheless, in the next section we show how to achieve approximation 
O(log^), if we allow embeddings into more complex spaces (that still allow reasonably good sketches). 


D.2 Embedding of EMD into the £i-sum of the small EMD instances 

In this section we provide a refined analysis of the embedding of EMD[ 4]2 from IIInd07ll . 

Suppose our goal is to sketch where A = 2^ for some integer / > 0. Eet 0 < t < / be a parameter 

to be chosen later. Eet us impose a randomly shifted hierarchy of nested grids with side lengths A/2^ A/2^', 
..., 1 (0(^^) grids in total). By “randomly shifted” we mean that the coarsest grid has a corner in a 
point s = (i'l ,^' 2 ) G [A]^ chosen uniformly at random, and all the finer grids are imposed by subdividing fhe 
cruder ones. Now lef us define fhe skefching procedure. Eirsf, we skefch EMD[^/ 2']2 insfances induced by 
fhe crudesf grid recursively (we have 0(2^') of fhese). Second, for each of fhese insfances we remember 
fhe fofal mass. Now, fo esfimafe EMD, we esfimafe EMD for fhe smaller insfances, add fhese esfimafes, 
fhen compufe EMD for fhe insfance induced by fhe fofal masses we remembered, multiply if by A/2^ (fhe 
side lengfh of fhe crudes! grid), and add if fo fhe resulf. This can be seen as a randomized embedding 
fs: EMDj^p ^i(EMD[( 2 ( 2 f)] 2 )- In IIInd07ll fhe following properties of fs are shown: 

Theorem D.ll. I\lnd07\l For every jU G EMDj^p.' 

• for every s, one has ||At||EMD|^p < 0(1) • H^pEMDi^pop)' 


ll/^Mll£l(EMD|Qp,,p) 


< O 


log A 


EMD 


lAp- 


In whaf follows we improve upon fhe second ifem in fhe above fheorem under fhe following additional 
assumpfions on jx. Namely, suppose we apply fs for random 5 fo a difference V — T, where V and T are 
non-negative measures over [A]^ fhaf sum fo one and v is k-sparse. 


Lemma D.12. Ifv and T as above, then 


Pr. 


\\fs{v - t)||£j(eMD| 


[ 0 { 2 ') 1 ' 


<0 1 + 


0 ^ 


log k +log log A 


)' 


V — tIIemd, 


'|A|2 


> 0.99. 


Proof As in fhe proof of Lemma lD(^ we cover fhe edges of optimal fransporfafion of V fo T wifh (9 (k log A) 
balls {B;} such fhaf every edge of lengfh r lies wifhin a ball of radius 0{r). Define fhe even! S as follows: 
“every ball Bj is nof cuf by a grid wifh side lengfh af leasf radius of By times CklogA”. We can choose C 
such fhaf Pr^ [£] > 0.999 (we can fake fhe union bound over fhe balls By and for every fixed ball we proceed 
as in Claim I d31) . 

Now lef us consider a fixed edge of lengfh r from fhe opfimal fransporfafion. The goal is fo argue fhaf, 
conditioned on £, fhe expecfed confribufion of fhe edge fo ||/i(v — T)||fpEMD|Qpf)p) 


^ log k +log log A + ^ 


Then we will be done by fhe friangle inequalify, Markov’s inequalify and fhe facf fhaf Pr^ [£] > 0.999. 

Lef us argue abouf fhe confribufion of fhe edge for every grid separafely. Eirsf, all grids wifh side lengfh 
less fhan r /10 confribufe af mosf 0{r) in fofal, because fhe endpoinfs end up in different subproblems, and 


22 
























thus the contribution is proportional to the side length. The side lengths accumulate as geometric series, so 
we have that the sum is 0{r) in total. 

Grids with side length at least C' - r- ^log A (with C' being large enough) do not contribute anything, 
conditioned on £. 

Grids with side lengths between r/10 and C • r • ^log A contribute in expectation 0{r) each (see Lemma 3.3 
in IlIndOVII ). Conditioning on £ can change the expectation by at most a constant factor, since Pr^ [£] > 0.999. 
Since we have (9((log A: + loglog A)/t) such grids, the required bound follows. □ 

D.3 Implications for sketching of EMD 

Theorem D.13. One can sketch linearly EMDj^p for measures that are differences of two non-negative 
measures that sum to 1, one of which is k-sparse as follows: 

• with sketch size 0{l) and approximation (9(log^ + loglogA); 

• with sketch size (9(log^ A) and approximation 0{\ogk) for every constant 0 < 5 < 1. 

• Moreover, if both measures have all the weights being multiples ofljn, where N is a positive integer, 
then the first of the results can be improved to having approximation (9(log^ + loglogA^). 

Proof The first result follows from composing the first item of Theorem ID . 3 1 and Lemma ID^ with a sketch 
for l\ from UlndOhl . The third result is similar, except we use Lemma D.lOl 

As for the second result, the starting point is the first item of Theorem iD. 11 I together with our Lemma lD.121 
Let us set t = 5loglogA. This way, we get a randomized embedding of into ^i(EMDj^^j^g 5 ^^p) 

with distortion (9(logk). Then, we apply the result of Verbin and Zhang IIVZ12II to perform dimension re¬ 
duction. Namely, we need to apply their randomized map twice to reduce the dimension to (9(loglogA). As 
a result, we get a sketch of size (9(log^^^^ A) and distortion (9(logk), if 5 is a (small) positive constant. □ 

Theorem D.14. One can sketch linearly EMD[^] over interval [A] of measures that are differences of two 
non-negative measures that sum to 1 , one of which is k-sparse. Wfe can achieve sketch size 0{\/e^) and 
approximation 1 + £. 

Proof Using Lemma lD.il we can isometrically embed EMD over the interval [A] into l\. Now we can 
sketch i\ using the sketch from IIInd06ll . This give sketch size 0(1/£^) and approximation 1 + £. □ 

E Sparse recovery for EMD 

The following three theorems follow from Lemma ITT] and Theorem ID. 131 

Theorem E.l. There is a linear sketching scheme of probability distributions over [A]^ with the following 
guarantees. The size of the sketch is 

O (k(log log A) log (log k + log log A) + log log (A/A)) 

and, given a sketch ofx, we can recover x* such that 

EMD(x,x*) < max((9(logk + loglogA) min EMD(x,y),A). 

k - sparse x' 


in time polynomial in A and log^^*^^ A. 
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Proof. Lemma lX6] gives that the doubling dimension of ^-sparse probability measures over [A] ^ is O log log A). 
Combining this with Lemma [XT] and the first result from Theorem ID. 131 we get the stated guarantees. □ 

Theorem E.2. There is a linear sketching scheme of probability distributions over [A]^ with the following 
guarantees. The size of the sketch is 

0(1) (log^ A) (log log A) log log k + log log (A/A)) 

for some constant 5 > 0. Given a sketch ofx, we can recover x* such that 

EMD(.r,.r*) < max(0(logA:) min EMD(x,x^), A). 

k - sparse x' 


in time polynomial in A and log^^*^^ A. 

Proof. EemmajXhjgives that the doubling dimension of ^-sparse probability measures over [A] ^ is O (^ log log A). 
Combining this with Eemma lXTl and the second result from Theorem lD.131 we get the stated guarantees. □ 

Theorem E.3. Let N be a positive integer. There is a linear sketching scheme of probability measures that 
have granularity l/N. The size of the sketch is 

0(^(log log A^) log (log k + log log A^) + log log (A/A)) 

and, given a sketch ofx, we can recover x* such that 

EMD(x,x*) < max(0(log^ + loglogA^) min EMD(x,x'), A). 

k - sparse x' 

in time polynomial in A and log^^^^ A. A is the upper bound on PMD{x,y) for the starting k-sparse approx¬ 
imation y ofx. 

Proof. Eemma ICT] gives that the doubling dimension of ^-sparse probability measures with granularity \/n 
is (A(^loglogA^). Combining this with Eemma lXTl and the third result from Theorem lD.131 we get the stated 
guarantees. □ 

Theorem E.4. There is a linear sketching scheme of probability distributions over interval [A] with the 
following guarantees. The size of the sketch is 

O (1 /£^) (^(log log A) log ^ + log log (A/A)) 

and, given a sketch ofx, we can recover x* such that 

EMD(x,x*) < max((l + e) min EMD(x,y), A). 

k - sparse x' 


in time polynomial in A and log^^*^^ A. 

Proof. Eemma 1X6] also gives that the doubling dimension of A-sparse probability measures over interval [A] 
is C?(AloglogA). Combining this with Eemma [XT] and Theorem ID. 141 we get the stated guarantees. □ 
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E.l Lower Bounds for Sparse Recovery for Earth Mover’s Distance 

Lemma IC^ and Lemma I bTT] gives the following two theorems lE.SI and lR^ 

Theorem E.5. Any linear sparse recovery scheme with approximation factor K with respect to EMD over 
interval [A] requires 

Q.{k)log (a(logf)) 

--i-E- 

logi^ 

measurements for sparsity k> 1. 

We want to compare guarantees of Theorem | e 3] with the lower bound that we achieve in Theorem lE.51 
Theorem IE.4I and assumptions that £ is a constant and A > gives approximation guarantee 


EMD(x,x*) < max(0(l) min EMD(x,x'), A) (12) 

k - sparse x' 


with log log A) number of measurements. 

Theorem IE. 5 1 and assumption that ^ < A'^'^ for some constant c > 0 give lower bound (A: log log A) on 
the number of measurements for constant approximation factor. However, this lower bound holds for the 
case when A is equal to 0 in guarantee [T2j 

Erom the proof of Eemma lBT] (equations (HI), dH) and (fTOll ) and Eemma lC^ twe construct ^-cover for 
EMD ball of radius k/2) we see that we are actually good as long as A is sufficiently small. As long as 
A < A for some large constant C. Therefore, our lower bound holds if > A > . 

We see that the upper bound and the lower bound match for the described range of parameters. 


Theorem E.6. Any linear sparse recovery scheme with approximation factor K with respect to EMD over 
square [A]^ requires 


m > 


D(^)log (^D(log^)^ 
logic 


measurements for sparsity k> 1. 


We want to compare guarantees of Theorem IE. 1 1 with the lower bound that we achieve in Theorem IE.6I 
Theorem IE. H and assumptions that £ is a constant and A > gives approximation guarantee 


EMD(x,x*) < max(0(log^ + loglogA) min EMD(x,y),A) (13) 

k - sparse x' 


with (A(^(loglogA)log(log^ + loglogA)) number of measurements. 

Theorem lE.6l and assumption that ^ < A^^'^ for some constant c > 0 give lower bound 1) 
on the number of measurements for approximation factor 0(log^ + loglogA). However, this lower bound 
holds for the case when A is equal to 0 in guarantee [13] 

Erom the proof of Eemma IB . 1 1 (equations dH), (jH) and (fTOll ) and Corollary 1C. 31 (we construct 2 ^-cover 
for EMD ball of radius k/2) we see that we are actually good as long as A is sufficiently small. As long as 
A < for some large constant C. Therefore, our lower bound holds if > A > 

We see that the upper bound and the lower bound match up to a factor of log^ (log k + log log A) for the 
described range of parameters. 


F Sketching of 1 -Median 

Eor a vector x € W, we use ||x||merf to denote the median over / S [n] of |x,- 
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F.l Subspace embeddings 

Lemma F.l. Let Lbe a d-dimensional subspace o/M”. Let A £ be a matrix with m = 0{\d\og^) 
and i.i.d. Cauchy entries with scale parameter 7=1. With 1 — 5 probability, for all x€ L we have 

(1 -e)||jc||i < < (1+ e)||.^||i. 

Proof. In an abuse of notation, let L be an orthonormal basis for the subspace L. For any threshold T = 
poly(^), the probability that any entry of AL has absolute value larger than T is Oipfdix), using that the i\ 
norms of the columns of L is s/d. Setting T = 0{d^'^/8), we have that every entry of AL is at most T with 
probability \ — 8j2. Suppose this happens. 

Then for all X £ we have that ||Zji:||i > ||Lr||2 = ||.r||2 > ll-^lli/ v/5 and ||AL.x||oo < T ||.v||i < T\/d||Ta:|| i. 
Thus for all y £ L we have 

l|A).||,<(</V5)|b||,. 

Let t' = d^ 18. 

We construct an :p-net L in the l\ norm for the unit l\ ball intersect L, which has size at most (1 + 
z' je/ = by the standard volume argument. 

For any x £ M", we say Ax is “good” if only a ^ — C 2 S fraction of coordinates are too large or too small, 
i.e. 

|{/:|(A.r),-| < (l-e)||;c||i}| < (^-C 2 e)m 
|{/:|(A.r),-| > (l+e)||;c||i}| < (^-C 2 e)m 

for some small constant C2. If A.r is “good”, then for any y with at most C 2 Sm coordinates larger than £||jc||i, 
we have 


(1 -2e)||;c||i < \\Ax-\-y\\med < (1 +2e)||;c||i. (14) 

Because {Ax)i is a Cauchy variable with scale ||.r||i, we have that 

Pr[|(A^),-|<(l-£)||^||i]<l/2-a(£) 

Pr[|(A.r),-| > (l + e)||^||i] < 1/2-^(e). 

By a Chernoff bound, for sufficiently small C2 we have that A;r is “good” with all but probability. 

For our choice of m, we can union bound to have that Ax is “good” for all a: £ T with all but < §Wd) 

probability. 

Every y £ L with ||y||i = 1 can be expressed as x + z for x £ T and ||z||i < e/z'. We have that Ax is 
“good” and that ||oo||Az < t'||z||i < £. Hence by (fT4l) . 

(1-2£)||x||i < ||Ay||me^ < (1+2£)||x||i. 


which implies 


(l-3£)||y||i<||Ay|U^<(l + 3£)||y||i. 

Since A is linear, the restriction to ||y || 1 = 1 is unnecessary; rescaling £ then gives the result. □ 

Corollary F.2. Let A have 0{dlog{d/{s8))/e^) rows and Cauchy entries with scale 7=1. For any sub¬ 
space L of dimension d and subset S <ZL, with 1 — 5 probability we have that 

X := argmin^g^llAxIUec/ 


satisfies 


ll^li < (1 + £)min||x|| 

x€S 
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F.2 1-median in d dimensions 

F.2.1 1-median in 1 dimension 

Theorem F.3. We can find a 1 e-approximation to the l-median in 1 dimensions using 0(log(l/e)/e^) 
linear measurements and exp (poly ( 1 /e)) time. 

Proof. Define B G by = i and B ;2 = 1 for all i G [n]. For any x G M", define G to be the 

diagonal matrix with D, ,• = x,. Then for any j G [n] and z = (/, — 1), we have that 

\\DxBz\U \i-j\ 

i 

is the cost of using j as the median for x. 

Let A G for m = 0(log(l/e)/e^) have i.i.d. Cauchy entries. Then ADj;B G consists of 2m 
linear measurements of x. 

Furthermore, the set of S = {D^Bz | Z 2 = — 1} is a subset of a 2-dimensional subspace. Hence, by 
Corollary IF.21 

z = argmin ^^^2 \\ADxBz\\med 

zi^-l 

satisfies 

||OxBzl|i < ( 1 +e) min ||D;cBz||i = ( 1 + £)cost(x). 

Z 2 = -l 

Given AD^B we can compute z, from which we recover zi as a (1 + e) approximation to the l-median. □ 

F.2.2 l-median in d dimensions 

Claim F.4 (Dvoretsky’s Theorem BDvohOll l. Let G G have suitably scaled i.i.d. Gaussian entries, for 
m = 0{d/ e^). Then with all but probability, for all x G we have 

||Gx||i < ||x|| 2 < (l + e)||Gx||i. 

Theorem F.5. We can find a\-\- e-approximation to the Euclidean l-median in d dimensions using 0{d^ log(r//e)/e^) 
linear measurements and exp (poly (d/e)) time. 

Proof. Let G G for t = 0{dle^) satisfy Claim lF4l so 

||Gp||,<|H| 2 <(l + e)||Gp||i 

for all p G [nY. For each point p G [nY, define the matrix G by the first t columns being the 

identity matrix and column t -I- 1 being Gp. 

Define G' G to equal G over the first t xd submatrix, Gt+\.d+\ = 1> and zero elsewhere. For 

any point p G [nY define z^^^ G by z; = pi for i < d and Zd+\ = —1- For any G [nY, we have 

B(^^G% = { I Gd ) ( 0 ^ ) ( 4 ) 

Hence 

||B(«)G'zp||i = ||G^-Gp||i < ||p-^||2 < (1 + £)||b(^)g'zpI|i. 

For X G M""^, define G to be the concatenation of the matrices XpB^P'> for all p G [nY. Then for 

all X G and p G [nY, therefore, 

||C,gV^)||i <cost(x,p) < (l + e)||C,G'z(^)||i. (15) 


27 




LetA G form = 0{d\og{d / e) / £^) have i.i.d. Cauchy entries. Our method observes 

AC^G' G 

which is a set of m{d + 1) = 0{d^log{d/£)/£^) linear measurements of w 
By Corollary IF. 21 with good probability we have that 

z = argmin \\ACjcG'z\\med 

Zrf+1 = -1 

satisfies 

||C;fG'z||i < (1+ £) min ||C;cGz||i = (1+ e)cost(.x). 

z€K''+‘ 

z<i+i=-l 

Hence by ([15]), for p = (zi,... ,Zd), 

COSt{x,p) < (1 +£)^COSt(.v). 

Given AC^G' we can compute z, from which we get p as a (1 + £) approximation to the 1-median. □ 


G Lower bounds for ^-means 

In this section we prove lower bounds for sketching and streaming k-means. 

First, one can extend the definition of EMD to the sum of squares of distances. Let us denote the 
corresponding “distance” EMD^. It is immediate to see that equipped with EMD^ is a 2-quasi-metric 
space. Sparse recovery with respect to EMD^ is equivalent to the k-means clustering. 

Second, observe that the construction from Section lC^l can be translated verbatim to EMD^ to show that 
the doubling dimension of the latter is • log log as well. 

Einally, observe that the results of Section [Bjcan be applied to EMD^ as well. Indeed, EMD^ enjoys the 
polynomial aspect ratio and relaxed triangle inequality, and these two happen to be enough for the argument 
to go through. 

As a result, we get the lower bound Q.{k ■ log log ^ / log A') on the number of measurements necessary 
for the linear sketching of k-means with approximation K. 

Alternatively, we can consider the streaming model and reuse the proof from Section |B] to show that 
streaming k-means with approximation K requires 

/ k A^ \ 


bits. 
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